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Abstract —Recent research shows that copying is prevalent 
for Deep-Web data and considering copying can significantly 
improve truth finding from conflicting values. However, existing 
copy detection techniques do not scale for large sizes and numbers 
of data sources, so truth finding can be slowed down by one 
to two orders of magnitude compared with the corresponding 
techniques that do not consider copying. In this paper, we study 
how to improve scalability of copy detection on structured data. 

Our algorithm builds an inverted index for each shared value 
and processes the index entries in decreasing order of how much 
the shared value can contribute to the conclusion of copying. We 
show how we use the index to prune the data items we consider 
for each pair of sources, and to incrementally refine our results 
in iterative copy detection. We also apply a sampling strategy 
with which we are able to further reduce copy-detection time 
while still obtaining very similar results as on the whole data set. 
Experiments on various real data sets show that our algorithm 
can reduce the time for copy detection by two to three orders of 
magnitude; in other words, truth finding can benefit from copy 
detection with very little overhead. 

I. Introduction 

As we enjoy the abundance of information on the Web, we 
are often confused and misguided by low-quality data, which 
can be out-of-date, incomplete, or erroneous. Recently, Li et 
al. lfT2l showed that even for domains such as stock and flight, 
conflicting values are provided by different Deep Web sources 
on 70% of the data items (e.g., closing price of a stock). 
In addition, although well-known authoritative sources, such 
as NASDAQ for stock and Orbitz for flight, often have fairly 
high accuracy, they may not have the desired coverage. Many 
applications, such as integrating Web-scale data and building 
knowledge bases from the Web, call for advanced data fusion 
techniques to resolve conflicts from different sources and 
identify values that reflect the real world. 

Typically, we expect that the values provided by many 
sources are likely to be true. Unfortunately, data copying is 
common on the Web: a false value can spread through copying 
and become quite popular. Thus, we need to detect copying 
and discount data from copiers for truth finding. Most of 
current copy-detection techniques on structured data 0, 0, 
0, d share two features. First, they exploit the intuition that 
copying is likely if many false values are shared, since this 
is unlikely to happen between independent sources. Second, 
since the truthfulness of data is often unknown a priori, 
they iteratively conduct copy detection, truth finding, and 
source-accuracy computation until convergence. In lfl2ll the 
authors show that the copy-detection techniques of 0 can 
significantly improve truth-finding results in the presence of 


copied values and fix half of the errors made by naive voting or 
considering only source accuracy. Detecting copying between 
Web sources is also important to finding the truths in building 
knowledge bases (9), which are widely used in Web search. 
Copy detection is also valuable in studying dissemination of 
information, protecting the rights of data providers, and so on. 

Despite its importance, research on copy detection for 
structured data is still in its infancy, focusing mainly on 
effectiveness of the techniques. Current techniques examine 
every shared data item between every pair of sources in each 
iteration to detect copying, so do not scale when the sizes 
of the sources, the number of the sources, or the number 
of iterations is large. As pointed out in fl2l . even on a 
medium-sized data set (55 sources and 16,000 data items), 
conducting copy detection would slow down data fusion by 
one to two orders of magnitude. In the Big Data environment 
where the data sources are growing rapidly in size and millions 
of sources in the same domain are emerging 0, scalability 
is important for successful copy detection. Consequently, we 
propose and study the problem of how to improve scalability 
of copy detection for structured data. 

Before we describe our techniques, it is instructive to 
consider scalable techniques that have been proposed for 
discovering copying on other types of data, such as text 
documents and software programs (surveyed in 0 ). Each 
document or program can be considered as a text sequence. 
Reuse of sufficiently large text fragments is taken as evidence 
for copying and copy detection essentially looks for such 
fragments. To improve scalability, proposed techniques create 
signatures (or fingerprints) of each text fragment and build 
an inverted index for all or selected signatures. A pair of 
documents or programs are compared only if a sufficiently 
large number of signatures are shared. Directly applying such 
techniques on structured data would be inadequate for two 
reasons. First, different values need to be treated differently 
in copy detection: sharing false values is treated as strong 
evidence for copying whereas sharing true ones is considered 
as a possible coincidence and treated only as weak evidence; 
thus, whether copying is likely depends not only on the number 
of shared values but also on the truthfulness of the shared 
values. Second, there is no natural way to order structured 
data (records and attributes); thus, large text fragments may 
not be shared by sources even if there is copying. 

We propose a comprehensive set of index structures and 
algorithms for this problem and make the following contri¬ 
butions. First, we design an inverted index, where each entry 
corresponds to a shared value for a particular data item and 


lists the data sources that provide this value. The presence of 
a source in an index entry guarantees its absence in all entries 
that correspond to other values for the same data item. We 
associate each entry with a score, derived from the probability 
of the value being false; the higher the score, the stronger the 
evidence that sharing the entry can serve for detecting copying. 
We process the entries in decreasing order of the scores and 
consider a pair of sources only if they share at least some 
value with a high score (Section [Hi) . 

Second, we propose pruning algorithms to further improve 
scalability for copy detection. Since we process index entries 
in decreasing order of their scores, we consider strong evi¬ 
dence early as we scan the index. Once we have accumulated 
enough evidence to decide copying or no-copying, we can stop 
without considering every shared value (Section IV i. 

Third, we develop incremental algorithms for iterative copy 
detection. We observe that between consecutive iterations, our 
truth-finding decisions typically change very slightly, and so 
do our decisions on copying. Instead of detecting copying from 
scratch in each iteration, we refine our decisions incrementally 
from previous iterations (Section |Vji. 

Finally, we experimented on a variety of real data sets, 
showing high scalability of the proposed techniques and 
big performance gains over simple sampling strategies (Sec¬ 
tion m We show that our algorithms together can speed 
up copy detection by two to three orders of magnitude, and 
can detect copying for thousands of sources in seconds even 
on a single server. With our algorithms, copy detection can 
significantly improve truth finding with very little overhead. 

Our index and pruning techniques also shed light on other 
applications that require computing similarity by accumulating 
weighted evidence; for example, in record linkage different 
attributes may have different weights in the computation of 
record similarity. 


II. Preliminaries 

We review copy-detection techniques for stmctured data and 
describe the opportunities to improve scalability. 

A. Copy detection 

Copy detection for structured sources was recently studied 
in 0, 0, 0, 0, fl5l . They consider a domain V of data 
items, each describing a particular aspect of a real world 
object, such as the capital of a state. They consider a set S of 
data sources, each providing data for a subset of data items 
in V; we denote by D(S) the items provided by S £ S. 
Schema mapping and entity resolution are assumed to have 
been performed so it is known which data items are shared 
between the sources (errors in these stages can be treated as 
wrongly provided data). A source S) is considered as a copier 
of S 2 if Si copies a subset of data values from S 2 . Copy 
detection aims at finding copying between sources in S. 
Bayesian analysis: The key idea in copy detection is to 
examine the values shared between a pair of sources^ It is 

‘Advanced techniques also consider coverage and formatting of 
data items a, for which we can extend our techniques. 


assumed that each data item is associated with a single true 
value that reflects the real world, but there are in addition 
n > 1 false values in the domain for each data item, and 
they are uniformly distributed^] Thus, the likelihood that two 
independent sources share the same false value is typically 
low. As a result, sharing false values on a large number of 
data items serves as strong evidence for copying. 

Based on this intuition, Bayesian analysis is conducted for 
copy detection El, 0, 0- 0- Consider two sources Si and 
S -2 and let $ be the observation on their data. Denote by Si —i 
S 2 (or S 2 «— Si) copying by Si from S 2 , and by Si_LS 2 no¬ 
copying between them|^]lt is assumed that there is no mutual 
copying (Si copies from S 2 and S 2 copies from Si), so 

Pr(Si_LS 2 |3>) 

= _ liMSISx-LSiQ _ 

/3Pr($|Si_LS 2 ) + aPr(<f»|Si -5- S 2 ) + aPr($|Si <- S 2 ) 


Here, 0 < a < .5 is the a-priori probability of a source 
copying from another one and 8=1— 2a. Assuming 
independence between data items, and denoting by <f>£> the 
observation on data item D, we have Pr($|Si_LS 2 ) = 
n£> e x)Pr(<b£>|Si_LS 2 ) (similar for other cases). Thus, we can 
rewrite Eq.(|T]) as follows. 


Pr(Si_LS 2 |$) 

1 


1 + f (IIogB 


Pr(4> D |SiXS 2 ) 


+ n.oex> 


Pr(& D \Si<—S2) \ 
Pr(<f> D \S 1 ±S 2 ) > 


■ ( 2 ) 


Computation of Eq.Q would require computing 
njgp ; we denote its logarithms 

b y C -► = E De-D ln P p%n\ S s\~lS (similarly, 

<?«- = E Dev ln PrSoilitla) )• Essentially, C U and 

accumulate the contribution from each data item 
D £ V. We denote the contribution score from I) to C_>. 
by C_>.(-D) = In IS 1 X.S 2 ) anc * com P ute it as follows 
(similar for C<-). 

1. Providing the same value v: For each source S £ S, its 
accuracy, denoted by A(S), is measured as the fraction of its 
true values over all provided values. This can be considered 
as the probability of S providing a true value for a data item. 
Then, the probability of Si and S 2 independently providing the 
(same) true value is A(Si)A(S 2 ), and that for the same false 

value is -n = A-MS^i-a^)) „ j is 

n n n v 

assumed that there are n uniformly distributed false values). 
In practice, we are often not sure which value is true. Let 
P(D.v) be the probability of value v being true for D, then 


Pr($ D |Si_LS 2 ) = P{D.v)A(S 1 )A(S 2 ) 

+ (1 - P(D.v)) ^ ~ ~ A ( g2 )) . ( 3) 

n 

Now consider copying. Let s be the selectivity of copying] 
that is, the probability that the copier copies on a particular 
item. When Si copies from S 2 on D (with probability s), they 
must provide the same value. The probability of our observed 


2 This assumption can be relaxed to take value distributions into 
account (6|, but is used here for simplicity. 

3 We can also extend our techniques to distinguish direct copying 
from co-copying and transitive copying (5) and we skip the details. 
4 a, n, s are inputs and can be set/refined according to 0, 0- 
















TABLE I 

Motivating example. False values are in italic font. An empty 

CELL CORRESPONDS TO A MISSING VALUE FROM A SOURCE. 



Accu 

N] 

(DO 

^Z 

(D 2 ) 

JNY 

(D 3 ) 

FL 

(Di) 

^X 

(DO 

So 

0.99 

Trenton 

Phoenix 

Albany 


Austin 

Si 

0.99 

Trenton 

Phoenix 

Albany 

Orlando 

Austin 

s 2 

0.2 

Atlantic 

Phoenix 

NewYork 

Miami 

Houston 

S 3 

0.2 

Atlantic 

Phoenix 

NewYork 

Miami 

Arlington 

s 4 

0.4 

Atlantic 

Phoenix 

NewYork 

Orlando 

Houston 

s 5 

0.6 

Union 

Tempe 

Albany 

Orlando 

Austin 

S 6 

0.01 


Tempe 

Buffalo 

PahnBay 

Dallas 

s 7 

0.25 

Trenton 


Buffalo 

PalmBay 

Dallas 

S 8 

0.2 

Trenton 

Tucson 

Buffalo 

PahnBay 

Dallas 

s 9 

0.99 

Trenton 



Orlando 

Austin 


value then depends on the likelihood that S 2 provides the 
value. The probability of S 2 providing the true value is .4(£ 2 ) 
and that for a false value is 1 — A (S 2 ) ■ Thus, the probability 
for our observation of ,SVs data on D, denoted by n(S 2 ), is 

Pr($ D (S 2 )) = P(D.v)A{S 2 ) + (1 - P(D.v)){ 1 - A(S 2 )). (4) 

When S\ does not copy from S 2 on I) (with probability 
1 — s), the probability that they both (independently) provide 
v is the same as Pr(<I>£>|Si+S 2 ). Thus, 

Pr($ D |Si -A S 2 ) = (l-s)Pr(3> D \S 1 ±S 2 )+sPr{$ D (S 2 )). (5) 


Combining Eq.([3]j5]», we have 

C+(D) = ln(l-s + s- 


Pr($p(S 2 )) , 
Pr($D\Si-LS 2 ) ‘ 


( 6 ) 


2. Providing different values: When two sources provide 
different values, the copier cannot copy (the probability is 1 — 
s) and they independently provide different values. Thus, 


Pr($ D \Si -*• S 2 ) = (1 - s)Pr($ D |Si+S 2 ); (7) 

C-f(D) = ln(l — s). ( 8 ) 


It has been proved that C^(D) is positive when .S) and S 2 
share the same value on D and negative otherwise, and it is 
larger when the shared value has a lower P(D.v) (i.e., v is 
more likely to be false) 0 . In other words, sharing a value 
serves as evidence for copying and vice versa, and sharing a 
false value serves as strong evidence for copying. 

Example 2.1: Consider 10 data sources that describe capi¬ 
tals for 5 states in the US (Table [I]); their accuracy measures 
are shown on the second column. There is copying between 
S 2 — S 4 and between Sq — Sg. We set a = 0.1, s = 0.8, and 
n = 50. 

Consider S 2 and S 3 as an example. Starting with I )\, they 
provide the same value so we apply Eq.([ 6 ]i. Suppose that 
NJ.Atlantic has probability .01 to be true. Then, CT (Df) = 
CUDt) = ln (.2 + .8 • , 01 ^;. 2 2 :. 9 9 9 9 : 4 . ) = 3-89, showing 
that sharing this false value is strong evidence for copying. 
We compute for other items similarly and eventually C = 
CV_ = 3.89 + 1.6 + 3.86 + 3.83-1.6 = 11.58. Applying Eq.Q 
computes Pr(S 2 -LS 3 |<b) = .00004, so copying is very likely. 

Now consider So and Si, which also share 4 values. How¬ 
ever, suppose we know that these values are all true and each 
of them has a contribution .01 (details skipped). Eventually, 


TABLE II 

Iterations for the motivating example. 



Rnd 1 

Rnd 2 

Rnd 3 

Rnd 4 

Rnd 5 

So 

0.75 

0.94 

0.96 

0.98 

0.99 

Si 

0.98 

0.99 

0.99 

0.99 

0.99 

s 2 

0.38 

0.23 

0.21 

0.2 

0.2 

S 3 

0.38 

0.23 

0.21 

0.2 

0.2 

s 4 

0.58 

0.43 

0.41 

0.4 

0.4 


(a) Source accuracy. 



Rnd 1 

Rnd 2 

Rnd 3 

Rnd 4 

Rnd 5 

NJ.Trenton 

0.9 

0.95 

0.96 

0.97 

0.97 

NJ.Atlantic 

0.07 

0.03 

0.02 

0.01 

0.01 

AZ.Phoenix 

0.94 

0.95 

0.95 

0.95 

0.95 

NY.Albany 

0.07 

0.77 

0.88 

0.92 

0.94 

NY.NewYork 

0.84 

0.16 

0.08 

0.03 

0.02 

FL.Orlando 

0.9 

0.92 

0.92 

0.92 

0.92 

FL.Miami 

0.05 

0.03 

0.04 

0.03 

0.03 

TX.Austin 

0.9 

0.93 

0.95 

0.96 

0.96 

TX.Houston 

0.04 

0.03 

0.02 

0.02 

0.02 


(b) Probability of values in the index. 


C_i = C<_ = .01 * 4 = .04 and Pr(S 0 -LSi|$) = .79, so 
copying is unlikely. □ 

Iterative computation: We often do not know value probabil¬ 
ity P(D.v) and source accuracy A(S), and computing them 
often requires knowledge of the copying relationship (details 
in 0). An iterative approach has been proposed as follows 0, 
0 : starting with assuming the same accuracy for each source, 
each round iteratively computes copying probability, value 
truthfulness, and source accuracy, until convergence. For our 
motivating example, there are five rounds before convergence. 
Table [11] shows the source accuracy and value probability 
computed in each round (for simplicity, we show only for 
the first 5 sources and their values). 


B. Opportunities for scalability improvement 

Previous works 0 , 0 conduct copy detection in an ex¬ 
haustive fashion. For each pair of sources, the algorithm, called 
Pairwise, does the following: (1) compute for each shared 
D £ V the contribution scores C->(D) and C^(D)\ (2) 
accumulate the scores and compute 6 + and (7<_; (3) apply 
Eq.([2] for probability computation. This process is repeated 
in every round. If the results converge in l rounds, the time 
complexity is 0(l\V\\S\ 2 ). PAIRWISE is not scalable if the 
number of sources or data items is large, or there are many 
iterations. The algorithm proposed in El also examines every 
pair of sources, so has similar complexity to PAIRWISE. 

There are several opportunities for improving scalability of 
copy detection. First, for some pairs of sources that share no 
value at all or just a few true values, we can determine that they 
are independent without going through all of the shared data 
items; this can reduce the number of source pairs we examine. 
For our motivating example, PAIRWISE requires examining 45 
pairs of sources; however, among them 18 pairs (such as So 
and Sq) do not share any value, and the pair of So and S 5 
share only two true values. We can skip these pairs. Section [lH| 
describes how we can explore this opportunity by building and 
using a specialized inverted index. 


























Second, for some pairs of sources that share a lot of false 
values, we can determine copying after we observe only a 
subset of these false values; this can reduce the number of data 
items we examine for those pairs. In our motivating example, 
S 2 and S 3 share 4 values, including 3 false ones; actually, after 
observing 2 false values, we can already determine copying 
without knowing the rest of the provided values. Section m 
explores this opportunity for single-round copy detection. 

Third, in the iterative process, the changes in value prob¬ 
ability and source accuracy between two consecutive rounds 
after the second round are typically very small. Thus, we can 
do copy detection incrementally to consider fewer data items 
for each pair of sources in later rounds. Section [V] explores 
this opportunity and describes incremental copy detection. 

Ideally, these aforementioned optimizations should tremen¬ 
dously reduce computation and thus execution time, while 
leading to the same (binary) decision on copying relationships, 
and also on value truthfulness. In practice, however, early 
pruning may improve efficiency with a slight loss of accuracy. 
We show in experiments (Section | VI[ ) the effectiveness and 
scalability of our techniques. 

We have also explored Fagin’s NRA (No Random Access) 
algorithm HQ) for top-fc search to speed up copy detection. 
We maintain for each value of a data item a list of contribution 
scores for the pairs of sources that share the value, and order 
the pairs in decreasing score order. We also maintain a list 
containing the accumulated contribution scores from different 
values for pairs of sources that have such differences. Then, 
6 A, (similar for CA-) for a particular pair of sources is the sum 
of the scores from all lists. To find copying, we can apply NRA 
to find the pairs with top values of C and CA- and stop when 
CA, and CA- lead to the conclusion of no-copying. However, 
we show in experiments (Section [VT]) that even generating the 
input to NRA ( i.e ., the ordered lists) for our problem is slower 
than our proposed approaches. 

III. Inverted Index 

We first describe an important building block in our 
solution-the inverted index, which facilitates the exploration 
of many aforementioned opportunities for scalability improve¬ 
ment. Inverted indexes were originally used in Information Re¬ 
trieval M and we describe the adaptation for copy detection. 

Building the index: An important component in copy detec¬ 
tion is to find for each pair of sources the values, not just 
the items, they share. We can facilitate this process with an 
inverted index, where each entry corresponds to a value v for 
a data item D, denoted by D.v, and contains the sources that 
provide v on D. Note that the presence of source S in the 
entry for D.v guarantees that S is not present in any of the 
entries for D.v',v' 7 ^ v. 

Intuitively, we wish to first consider sharing of values 
that serve as strong evidence for copying, as it provides the 
opportunity to prune weak evidence for copying. We order 
the entries according to their contribution scores to C and 
C Note, however, that according to Eq.([3]|6| the contribution 
from sharing D.v can be different for different pairs of sources 
with various accuracy; we choose the maximum one, denoted 


TABLE in 

Inverted index for the motivating example. The two sources 

USED TO COMPUTE THE CONTRIBUTION SCORES ARE IN BOLD. 


Value 

Pr 

Score 

Providers 

AZ.Tempe 

0.02 

4.59 

s 5 ,s 6 

NJ.Atlantic 

0.01 

4.12 

S 2 , S 3 ,S 4 

TX.Houston 

0.02 

4.05 

s 2 ,s 4 

NY.NewYork 

0.02 

4.05 

s a , s 3 ,s 4 

TX.Dallas 

0.02 

3.98 

S 6 ,S 7 ,S 8 

NY.Buffalo 

0.04 

3.97 

S 6 ,S 7 ,S 8 

FL.PalmBay 

0.05 

3.97 

S 6 , St, 5s 

FL.Miami 

0.03 

3.83 

s 2 ,s 3 

AZ.Phoenix 

0.95 

1.62 

So, Si, S 2 , S 3 , A 

NJ. Trenton 

0.97 

1.51 

So, Si, S r , S 8 , S 9 

FL.Orlando 

0.92 

0.84 

Sr, S 4 , S 5 , S 9 

NY.Albany 

0.94 

0.43 

S 0 , Si, S 5 

TX.Austin 

0.96 

0.43 

So, Sr, S 5 , S 9 


by M(D.v). The next proposition shows that we can compute 
M(D.v) only from providers (i.e., sources) with the maximum 
or minimum accuracy (proofs omitted to save space). 

Proposition 3.1: Let D.v be a value with probability 
P(D.v). Let A min be the minimum accuracy among D.v ’s 
providers. 


If A„ 


< 


_nP±D^,) 
- 1 - ' l-P(D.v) 


, M(D.v) is obtained by Eq.( 6 l 


when Si has the maximum accuracy and S 2 has trie 
minimum accuracy; 


If A rl 


> 


1 + 1 A 


p ( d . v) and P(D.v) < .5, M(D.v) is 


Wyj 


obtained by Eq.([ 6 ]) when S 2 has the minimum accuracy 


and Si has the second minimum accuracy; 

• Else, M(D.v) is obtained when Si has the minimum 
accuracy and S 2 has the second minimum accuracy. □ 
We can now formally define our specialized inverted index. 


Definition 3.2 (Inverted Index): Let V be a set of data items 
and S be a set of sources. The inverted index for V and S 
contains a set E of entries, such that for each Eg E, 

1) E corresponds to a value De-Ve, where I)e € D and 
wb is a value provided by at least two sources on De\ 

2) E is associated with probability P(E) for De-Ve being 
true and with contribution score C(E) = M(De-Ve)’, 

3) the entry contains a set S(E) of sources that provide 

De-Ve■ n 


Example 3.3: Continue with Ex |2.1| Table m shows the 
inverted index for the data, assuming knowledge of value 
probability. As an example, entry NJ.Atlantic has probability 
0.01 and contribution score 4.12, computed from pair (S 4 , S 3 ), 
with the highest and lowest accuracy among providers of 
NJ.Atlantic. Note that there is no entry for value NJ.Union, 
AZ.Tucson, or TX.Arlington, as each of them is provided by 
a single source. Also note that for any entries for the same 
data item, such as NJ.Atlantic and NJ.Trenton, there is no 
overlap between their sources. 

The following properties show that processing the entries 
in decreasing order of their contribution scores not only helps 
quickly accumulate strong evidence for copying, but also helps 
compute the upper bound of the contribution scores, making it 
amenable to additional optimizations. We also show in exper¬ 
iments (Section VI-Q that this processing order significantly 
improves over random ordering. 

















Proposition 3.4: For each pair of sources S), S 2 £ S 
and index entry Eg E, the following properties hold for 

C^/UDe). 

• If Si. S-2 £ 5 '(E), C->(De) is computed based on 
P(De-Ve). 

• If Si £ S(E), S2 S(E), but they share item De. they 
provide different values on De and C^(De) = ln(l— s). 

• If neither S) nor S 2 has appeared in any entry for De 

before entry E, C^.(De) < C(E). □ 

Optimizing with the index: With the inverted index, we 
can improve copy detection in three ways. First, copying is 
unlikely if two sources do not share any value; thus, we can 
skip source pairs that do not appear in the same entry. 

Second, copying is also unlikely if two sources share only 
a few true values and we can skip them too. To simplify 
the computation, we consider the entries with the lowest 
contribution scores and denote by E C E the subset of 
entries where Yege C{E) < In Then, for source pairs 
that do not share any value outside E, CL*. < In ^ and 
CV- < ln^, so Pr(S'i_LS , 2 |$) > l_ ^ = .5 and 

2a "I" 2a) 

copying is unlikely. Thus, we consider a pair of sources only 
if they appear together in some entry outside E. 

Third, since each data item for which the two sources 
provide different values contributes the same negative score 
ln(l — s) (Eq.([8jl), the accumulated score from these items 
depends only on the number of these items. This number 
can be derived from (1) the number of shared items, denoted 
by l(Si,S 2 ), counted at index building time (we can apply 
techniques for set similarity joins in to improve efficiency of 
counting), and (2) the number of shared values, denoted by 
n(Si,S 2 ), counted at index scanning time. 

We next describe an algorithm. Index, that uses the in¬ 
verted index for copy detection. Instead of considering each 
pair of sources. Index scans the inverted index in decreasing 
order of contribution scores and proceeds in three steps. 

1) For each entry E £ E \ E and each pair of sources 
S-f , S2 £ S(E), (1) compute the contribution from E 
and update C+ and C+ for (Sj, ,S' 2 ), and (2) maintain 

n(Si,S 2 ). 

2) For each entry E £ E. do the same as in Step 1 but 
only for pairs encountered before. 

3) After scanning the whole index, for each already con¬ 
sidered pair (1) update scores for data items 

where different values are provided by adding ln(l — 
s){l{S\, S 2 ) — n(S'i,S’ 2 )), and (2) compute copy prob¬ 
ability accordingly. 

Proposition 3.5: Let r be the number of source pairs for 
which we maintain scores. Index takes time 0(r ■ \V\) and 
space O(r), obtaining the same binary results as PAIRWISE. 
Note that index building has a much lower complexity: 

0(|S||2?|). □ 

The next example shows that the Index algorithm can 
considerably improve the efficiency of copy detection. 

Example 3.6: Continue with the motivating example. For 
Index, the last two entries in the index (Table m form the 
set E (.43 + .43 < In ^ = 1.39). There are only 26 pairs of 
sources that occur in entries outside E\ for example. So and ,SV> 


share only values in E, so we do not need to consider this pair. 
In total Index needs to examine 51 shared values and have 
51*2 + 26 * 2 = 154 computations (2 additional computations 
for each pair of sources on different values) for copy detection. 
Note that pairwise detection requires examining 45 pairs of 
sources and 183 shared data items, so in total conducting 
183 * 2 = 366 computations. For this example. Index cuts 
computation by more than half. □ 


IV. Detection in One Round 


Index does not need to consider every pair of sources 
and thus can save computation; however, for each pair it 
considers, it still examines all shared values. The properties 


of the inverted index (Proposition 3.4 1 make it possible to 


terminate after we examine only a subset of shared values 
for a pair. First, when we observe a lot of high-score (low- 
probability) entries to which both sources belong, we may 
conclude with copying early. Second, when we observe a lot 
of entries to which one of the two sources belongs and a lot 
of high-score entries to which neither source belongs, we may 
conclude with no-copying early. This section describes how 
we can speed up copy detection by making early decisions. 


A. Reducing examined shared values 

Given a pair of sources 5i and ,S' 2 , as we scan the in¬ 
dex, we can maintain for C+. a maximum and a minimum 
score, denoted by C™ ax and C™ ?n respectively; similarly we 
maintain C™ ax and C™". If the minimum scores are large 
enough to conclude copying, or the maximum scores are small 
enough to conclude no-copying, we can terminate early. For 
such pruning, we need to (1) decide the termination conditions 
and (2) compute maximum and minimum scores. 

Termination conditions: We first consider binary de¬ 
cisions for copying. According to Eq.Q, to guarantee 
P7’(5'i+iS , 2 | < f > ) > .5 (no-copying), we should have 1 > 

(e G ^ + e c ^)\ this must be true if C™ ax < In ^ and 
(jmax < l n A Thus, we define threshold = In ^ for no¬ 
copying. On the other hand, to guarantee Pr(Si J_ | f I») < .5 
(copying), we should have 1 < C jj(e c ^ + e c<_ ); this must 

be true if > In - or C+ m > In—. Thus, we define 

— r — Q Ct — Ot 

threshold () cp = In - for copying. If none of the conditions 

is satisfied after we scan the whole index, we apply Eq.Q to 
compute the probability of copying. 

If we instead wish to compute real copying probabilities 
when it is between [.1, .9] (or some other values close to 0 
or 1), we can consider three different cases: Pr(S'i+S , 2 |*I)) > 
.9, Pr(S'i+,S' 2 | < I>) < .1 and otherwise; we can compute the 
thresholds accordingly. 

Maximum/Minimum score computation: When we scan 
each entry E, we update C™ ,n and C™ a,T for every pair 
S), S 2 £ S(E) as follows (similar for C<_). First, 
is obtained when the two sources share only the observed 
common values and no other value. Let . S 2 ) be the 

sum of scores from observed common values. The score 
for each of the remaining items is negative, ln(l — s). Let 




no (Si, S2) be the number of observed shared values and recall 
that l(S u S 2 ) denote the number of shared items. Then, 


C™"(Si, S 2 ) = C°_ (S 1 ,S 2 ) + (Z(Si, S 2 ) - no(S lt S 2 )) ln(l - a). 

(9) 

For C™ ax , we need to consider the already scanned entries 
containing only Si or S 2 , or neither of them. We thus compute 
scores for three subsets of data items and C!” a3; is the sum of 
their scores. 

• Data items with observed shared values: The accumu¬ 
lated score from such items is C%(Si, S 2 ). 

• Data items with observed non-shared values: According 


to Proposition 3.4 for each data item D shared between 
Si and S 2 , if we have seen only Si or S 2 appearing in 
one of its entries, the contribution score for D is negative, 
ln(l — s). However, finding the precise number of such 
items requires recording the set of observed entries for 
each source and can cost a lot of space; thus, we estimate 
the minimum number from the numbers of observed 
values for Si and for S 2 , denoted by ri(Si) and n(S2) 
respectively. Let Ni be the overlapping items among the 
n(S 1 ) items scanned for Si-the size of N\ is roughly 
n(Si) ■ ; similar for S 2 , denoted by N 2 . The set 

of overlapping items among all scanned items is NiLiN 2 , 
their size satisfies \Ni U N 2 \ > max{|iVi|, |iV" 2 1} = 
max{n(Si) • ^^,n(S 2 ) • ^(SV)] }■ denoted b Y h - 
Thus, the two sources provide different values for at least 
h — no(Si,S 2 ) data items, and the maximum score is 
(h ~ no(Si,S 2 ))ln(l - s). 

• Data items we have not seen for S 1 or S 2 : There are 
at most l(S u S 2 ) — h such items, and according to 
Proposition |3.4| the maximum score for each of them is 
the score of the next unscanned entry, denoted by M. The 
maximum score for this subset is thus (l(Si, S 2 ) — h) ■ M . 

In summary, we have 

CT x (Si,S 2 ) = C^(S 1 ,S 2 ) 

+ {h- n 0 (Si, S 2 )) ln(l - s) + (l(S u S 2 ) - h ) • M. (10) 


Algorithm and analysis: From the previous analysis, we 
design algorithm Bound, which proceeds in four steps. 

Step I. Build the inverted index and initialize l(S\ . S 2 ) for 
each pair of sources that occur together in at least one entry 
of the index. Initialize the active set of source pairs as Q = 0. 

Step II. As we scan each entry E £ E\ A, do the following. 

1) For each S £ S(E), if it is observed for the first time, 

set n(S) = 1; otherwise, increase n(S) by 1. 

2) For each pair Si,S 2 £ S(E ) that we observe for the 

first time, set no(Si,S 2 ) = C%/f_(Si, S 2 ) = 0 and add 

it to Q. 

3) For each pair S±, S 2 £ S(E) D Q, do the following. 

(1) Increase no(S\,S 2 ) by 1 and update S 2 ). 

(2) Compute C™ l P_(Si,S 2 ). If either is above 9 cp , 

conclude copying and remove the pair from Q. 

(3) Compute C' n f a f_(Si, S 2 ). If both are below 9i n d, 

conclude no-copying and remove the pair from Q. 

Step III. As we scan each entry E £ E, do 1 and 3 in Step 
II (but only for pairs encountered before). 


3.6 


We have 9 cp = In ^ = 


Step IV. After we scan the index, for each pair (Si, S 2 ) £ Q, 
we have no(Si,S 2 ) = n(Si,S 2 ), so C_> = C™ m (similar 
for C<-). If both C+ and 6+ are below 0 ln ,i, conclude no¬ 
copying; otherwise, apply Eq. 0 

Proposition 4.1: Let r be the number of source pairs that 
share values, and e be the maximum number of shared entries 
we process for each pair before concluding. BOUND takes time 
0(r ■ e) and space 0(r + |S|). □ 

As Bound estimates the number of observed overlapping 
data items ( h ) in computing C™ ax , the result may be different 
from pairwise detection. However, the computation of h and 
the use of M in Eq.(p~0]> make the upper bounds already 
loose, so the decisions are rarely different, as we observed 
in our experiments (Section | VI[ ). Note that e can be much 
smaller than \D\, so BOUND often significantly reduces the 
total number of data items we consider in copy detection. 
However, computing upper and lower bounds of contribution 
scores introduces an overhead, so Bound may not always 
save computation for each pair of sour ces, as illustrated next. 

Example 4.2: Continue with Ex 
2.08 and 9 ind = In * = 1.39. 

First consider pair (S 2 , S 3 ); recall that they share 4 values 
(including 3 false ones) and copying is likely. We see them 
first at entry NJ.Atlantic where Cf^^ i _(S 2 , S 3 ) = 3.89. By 
Eq.j9| we compute C r fjf = 3.89 - 1.6 * (5 - 1) = -2.51. 
For maximum scores, h = 1 so by Eq.( 10 1 = 3.89 + 

0 + 4.05 * (5 — 1) = 20.09. We see this pair again at entry 
NY. New York. We update C%(S 2 , S 3 ) = 3.89 + 3.86 = 7.75, 
so C™ m = 7.75—1.6*(5—2) = 2.95 > 2.08 = 9 cp and we can 
conclude copying for the pair. While Index considers 4 shared 
values for them and conducts 4*2 + 2 = 10 computations. 
Bound considers only 2 shared values and conducts 4+1 = 5 
computations. 

Now consider pair ( Sq,Si ); recall that they share 4 true 
values and no-copying is likely. When we see them at the third 
shared entry NY.Albany, we have Cf f ^(S 2 , S 3 ) = .01 * 3 = 
.03, so = .03 + 0 + 0.43* (4-3) = .46 < 1.39 = 9 ind \ 

we can then conclude no-copying. Thus, Bound considers 3 
shared values and conducts 4*3 = 12 computations. However, 
Index considers 4 shared values for them but conducts only 
4 * 2 + 2 = 10 computations, fewer than Bound. 

In total BOUND considers 26 pairs, 33 shared values, and 
requires 116 computations. It considers 18 fewer shared values 
and conducts 38 fewer computations than Index. □ 


B. Reducing computation 

Although BOUND reduces the number of shared values we 
consider, it introduces the overhead of computing C™jf and 
Crnax . Actually, we do not need to maintain and 

(jmax eac | ;1 t j me we scan a entry; we only need to 

do so when termination is likely. This can further reduce 
computation for BOUND. 

First, suppose after scanning entry E, we compute C™'' ,n < 
9 cp and C™ ln < 9 cp for source pair (Si, S- 2 ). The next shared 
value can increase and C 1 ™" by at most M — ln(l — 

s) (recall that M denotes the contribution score of the next 
entry). Thus, we do not need to re-compute G™'+ until we 








] shared 


have observed at least T mm = \—- 
values. 

Similarly, suppose after we scan E, we compute (J r f !X > 
Oi n d or C™ ax > 9 in d for sources (Si, 82 ). A new data 
item on which Si and S 2 provide different values would 
reduce C™ ax and C r £ ax by M — ln(l — s), so we would 
resume computing maximum scores when we see Tf lax = 
puaxic^-i.c^ l-e^ i more different values. At entry E 
we have already seen h — no(Si,S 2 ) different values, so 
we need to see in total Tfi"‘ x 


+ h — no(Si,S 2 ) different 
values. We do not re-compute Cffff until n(Si) > T™ ax = 

1 or n(S 2 ) > T^ ax = 


~imax 
—►/•*- 

\(Tr x + h ~n 0 ( Sl , S2 ))- 

\(Tr x + h-n Q (Si,S 2 ))- 

Example 4.3: Consider a pair of sources Si and S 2 that 
share 101 data items. Suppose again that ln(l — s) = 
— 1.6 ,d cp = 2.08, Oind = 1.39. Suppose the first shared 
item we have observed between Si and S 2 has contribution 
C\,^_ = 5. Then 6™™ = 5-(101-l)*1.6 = -155 < 2.08. 
Suppose M = 4, then T min = \ 2 -°^|~ 1 6 5 ) 5) ] = 29, so we do 
not compute Cffjf until we have observed 29 other shared 
values. 

For maximum scores, suppose we have not seen other 
entries containing Si or S 2 yet, so h = 1 and =5+0+ 


(101-1)*4 = 405. Then, T™ ax = \= 72. Suppose 

l(S u S 2 ) _ l(S u S 2 ) _ O fh „ n rpmax rprnax r (72+i—t) i _ 

|D(Si)| “ |fl(S 2 )| “ -°Anen i 1 - i 2 -| 8 |- 

90. So we do not compute until n(Si) > 90 or 

n(S 2 ) > 90. □ 

We can improve BOUND accordingly and the result is 
called BOUND+. Note that BOUND+ has the same asymptotic 
complexity as BOUND but in practice can save a lot of 
computation. 

Finally, intuitively only when two sources share a lot of 
data items, we are likely to significantly reduce the number 
of considered shared values and compensate for the extra cost 
for bound computation. We can thus apply Index for pairs of 
sources that share only a few data items and apply BOUND+ 
for the rest of the pairs. We call the resulting algorithm 
Hybrid. Our experiments (Section |VI| > show that Hybrid 
can further reduce computation and copy-detection time. 


V. Incremental Detection 

Section |fV| considered copy detection in one single round; 
this section considers the iterative process. Our observation 
is that although there can be changes on value probability 
and source accuracy from round to round, after the second 
round the changes are typically small and seldom change 
our copy-detection decisions. A natural thought for improving 
scalability is to detect copying incrementally after the second 
round. We base our discussions on the Hybrid algorithm. 


A. Overview 

Both changes in value probability P(D.v) and changes in 
source accuracy A(S) can affect copy detection. We distin¬ 
guish big changes and small changes. If a pair of sources 
contains a source with big accuracy change, we need to 
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Step 1 for copying; 
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Step 2 in estimation for copying; 

Step 4 for copying or no-copying; 


Fig. 1. Entry categories. The steps for no-copying pairs are in italic font. 


recompute the probability of copying. For the rest of the 
pairs, we can incrementally update the contribution scores. We 
update scores on big-change entries first; only for pairs whose 
score changes can lead to an opposite decision on copying, we 
would further consider small-cliange entries. The challenge is 
to reduce the number of entries we consider whenever possible 
but still reach the same copying decision. 

We denote by P 0 id(D.v ) the probability used previously in 
score contribution. Note that the recorded probability may not 
be the one from the previous round, but can be from some 
earlier round when we do the last re-computation. For an 
entry D.v, we consider the change on M(D.v) rather than 
on P(D.v), since a small change of the latter may cause a 
big change of the former, which eventually matters in score 
computation. To separate the change of value probability from 
that of source accuracy, we compute M(D.v) on the same 
two accuracies used in the round with P a id(D.v). Finally, to 
classify big and small changes, we have a threshold p. We can 
either set a default one, or order the changes in decreasing 
order, choose the maximum gap between two consecutive 
changes, and set p to the change above the gap. 


B. Source pairs with copying 

We first explain our strategy for source pairs where we 
concluded with copying in the previous round. Recall that for 
a pair of sources, we may make our decision before reaching 
the end of the index; we call the last entry we considered 
the decision point. Accordingly, we can categorize the shared 
entries between this pair of sources into five categories (see 
FigjT}: (1) Ey. big-change entries whose contribution scores 
decrease (the probabilities of the entries increase) before the 
decision point; (2) E k: small-change entries whose scores 
decrease before the decision point; (3) Ef. big-change entries 
whose scores increase before the decision point; (4) Ey<: 
small-change entries whose scores increase before the decision 
point; and (5) E^: shared entries after the decision point. 

Among them, entries in E± and By would decrease the 
scores and may even change our decision. The high-level idea 
for our algorithm is to first consider the decreases, and then 
compensate for score loss with the increases from the other 
categories until the scores are once again above the threshold 
6 cp . In the latter process, we consider big increases (from Ey) 














first, and small ones (from E/) last. We next describe how 
we update CA (resp. C <_)■ 

Preparation step: As a preparation, in the round when we 
conduct copy-detection from scratch, we maintain for each 
pair of sources the number of shared values before the decision 
point and that after the decision point (the latter can be denoted 
by l-Eixl). We then compute the final score of this round, 
denoted by C A (resp. C A), as C A = C™ m — | Em j ■ ln(l — s). 
Note that the computation of C™" assumes that there is 
no value shared after the decision point and applies penalty 
ln(l — s) to each such shared value; the computation of CA 
removes this penalty but does not apply their real (positive) 
contribution; thus, C™ in < CA < CA- We use CA and CA 
as the starting scores for the next round. 

Step / ( E± U E\j: Each entry E £ E± may significantly 
reduce CA. We update CA by replacing the old score on E, 
computed by the old value probability and source accuracy, 
with the new one computed by the new value probability and 
source accuracy. Each entry E £ CA will reduce CE slightly. 
Instead of updating the change for each such entry, we use the 
maximum change, denoted by A p , which we estimate from the 
entry with the largest score decrease below p. We decrease CA 
by Ai = A p • \E\,\- If after these changes max{CA,CA} > 
0 cp still holds, we can stop; otherwise, we conduct Step 2-5 
and stop once max{CA, CA} > 6 cp . 

Step 2 (Em): In case the new score is below 9 cp , we look for 
data entries that can increase them back to above the threshold. 
Consider the shared entries after the decision point. Each of 
them should have a minimum contribution score, which can 
be estimated on the last entry in the index. We denote this 
score by m and increase C A by A 2 = m • |i?ix|- 

Step 3 (E-f): Each entry E £ Ey can significantly increase 
CE r and compensate for the loss. We update CE, by replacing 
the old score on E with the new one. 

Step 4 (Em): Each E £ Em may also increase C A a lot. 
We (1) increase C A by CA(-De), and (2) subtract m from 
CE and A 2 to remove our previous estimation on E. 

Step 5 ( E\, U Ey<): Now the only entries not updated are 
those with small changes before the decision point. For each 
E £ A’x, U Ey, we (1) update CE by replacing the old 
score on E with the new one, and (2) if E £ CA, increase 
CE, by A p and subtract A p from Ai to remove our previous 
estimation on E. 

Final step: We remove the estimation, recording CA + A 1 — 
A 2 as the precise (E (resp. CA) for the starting point of the 
next round. We also update the new decision point if needed. 
If the condition max{CA, CA} > 9 cp is not satisfied until 
the end, we apply Bayesian analysis to decide if we need to 
change our decision to no-copying. 

These steps can be combined into three passes of index 
scanning. The first pass conducts Steps 1 and 2, the second 
pass conducts Steps 3 and 4, and the third pass conducts Step 
5. Figure [I] summarizes the algorithm. We next illustrate the 
idea using an example. 

Example 5.1: Recall that for the motivating example there 
are five rounds before convergence (Table |H). Consider incre¬ 
mental detection at Round 3; it considers value probabilities 
from Round 1 as old ones and those from Round 2 as new 


TABLE IV 

Partial inverted index used in Round 3. Only source So — S4 

AND THEIR PROVIDED VALUES ARE SHOWN. 


Value 

Providers 

Pr 

AScore 

Cat. 

Score @i? 3 

TX.Houston 

SA S 4 

.04 —¥ .03 

.17 

A 

3.97 

FL. Miami 

So, S 3 

.05 -A- .03 

.21 

A 

3.83 

NJ.Atl antic 

s 2 ,s 3 ,s 4 

.07 -A- .03 

.39 

A 

3.96 

NY.AIbany 

So, Sr 

.07 —¥ .77 

-2.49 

1 

.52 

NJ.Trenton 

So, Si 

.9 -> .95 

-.12 

\ 

1.31 

NY.NewYork 

s 2 ,s 3 ,s 4 

.84 -a- .16 

1.69 

t 

3.17 

AZ.Phoenix 

S 0 — S 4 

.94 -a- .95 

-.01 

\ 

1.45 

FL.Orlando 

Si,S 4 

.9 —y .92 

-.01 

\ 

.78 

TX.Austin 

So, Si 

.9 —t .93 

-.01 

\ 

.51 


ones. Table [Tv] shows the inverted index. We set py = 1, 
so there are 2 entries with big score changes (in italics), 
corresponding to values for NY. 

First consider (A, A)- In Round 2 it terminates at entry 
NY.NewYork, having scores of = 4.73, sharing 3 

values before the decision point and 1 value after the point. 
Thus, CA/<- = 4.73 + 1.6 = 6.33. Among the shared entries 
before decision, 2 have small increases and 1 has big increase. 
Thus, the score is not decreased and we can terminate for this 
pair without further examination. 

Now consider (So, Si). In Round 2 it terminates at the last 
entry, having scores CE, = 1.15, CA = 1.66, and sharing 
4 values before the decision point; recall that 6 cp = 2.08 
and = 1-39, so we need to apply Eq.([2|, computing 
/AYA,S' 1 T) = .32 and deciding copying. Among the 4 
shared values, NY.AIbany has big score decrease and the other 
three have small decreases. The contribution scores for CA/«- 
from NY.AIbany were .44/1.53 in Round 2 and are .24/.09 
now. The largest score difference for the other three items 
is .015, computed from NJ.Trenton, which has the largest 
score decrease among small-change entries. Accordingly, we 
have CA = 1.15 + (.24 - .44) - .015 * 3 = .9 < d cp , 
and CA = 1-66 + (.09 - 1.53) - .015 * 3 = .17 < 9 cp , 
thus, we may change our decision. Since Ey = Em = 0, 
we cannot compensate for the loss of the score. We next 
reconsider the items with small changes and compute precise 
scores CA = -95 < 9 in j, CA = .20 < 9 in d- Therefore, we 
change our decision for this pair to no-copying. □ 


C. Source pairs with no-copying 

We handle source pairs with no-copying in a similar way. 
For such pairs, entries in Ey and E/ would increase the 
scores and may change our decision, while entries in £/ and 
/A would decrease the scores and compensate for the score 
increase, so we change the order of considering them. Also, 
Step 2 does not apply for no-copying pairs since we actually 
need to reduce the scores to compensate for its increase. Again, 
the steps are summarized in Figure [T] In addition, we compute 
CA/<- by Eq.llOi with two changes. First, we use the real 
number of different values obtained from bookkeeping rather 
than the estimated one. Second, in case the maximum score 
M has a big change, we update CA/«- upfront in each round. 

Example 5.2: Continue with Ex 5.1 and now consider no¬ 
copying pair (Sq, S 2 ). In Round 2 it terminates at entry 
AZ.Phoenix, having scores C™ ax = -4.75, C™ ax = -4.3, 
sharing 1 value before decision and 0 value after decision. The 












TABLE V 

Overview of data sets. 



#Srcs 

#Items 

#Dist-values 

#Index-entries 

Book-CS 

894 

2,528 

14,930 

7,398 

Stock-lday 

55 

16,000 

104,611 

40,834 

Book-full 

3,182 

147,431 

162,961 

48,683 

Stock-2wk 

55 

160,000 

915,118 

405,537 


shared value is in category E\^, so Step 1 does not change 
the score and we can terminate with the same decision. □ 
The final algorithm, INCREMENTAL, updates scores for all 
source pairs in three passes of index scanning. It requires 
more space for book-keeping across rounds, but in practice 
it recomputes scores for much fewer entries. 

Proposition 5.3: Let r be the number of source pairs that 
share values, and e! be the maximum number of shared entries 
we process for each pair. INCREMENTAL takes time O(re') 
and space 0(\£\ + r + |<S|) for a single round. □ 

Example 5.4: In Rounds 3-5 for our example, BOUND+ 
takes 102 computations for each round, while INCREMENTAL 
reduces it to 54, 29 and 0 respectively. The total number of 
computations for INCREMENTAL is 73% lower than that for 
BOUND+. □ 


VI. Experimental Results 

This section presents experimental results validating the ef¬ 
ficiency and effectiveness of the inverted index and algorithms 
proposed in this paper. We show that among the strategies we 
have proposed, the inverted index can improve the efficiency 
by one to two orders of magnitude and obtain exactly the 
same results; pruning and incremental detection together can 
improve the efficiency by nearly one order of magnitude and 
obtain very similar results; and a careful sampling can improve 
the efficiency by orders of magnitude without sacrificing the 
quality of the results too much. 

A. Experiment settings 

Data: We experimented on four data set^| Table [V] provides 
an overview. Two data sets were crawled from an online 
bookstore aggregator AbeBooks.com: Book-CS contains 894 
sources (i.e., book stores), 1265 CS books, and 2528 data 
items including the title and author list of each book (there 
are missing values for some books); on average 5.9 conflicting 
values are provided for each data item. Book-full contains 
3182 sources, 81,352 books of all categories, and 147,431 
data items; on average 1.1 conflicting values are provided for 
each data item. A gold standard for Book-CS contains author 
lists verified from book title pages for 100 randomly selected 
books. 

The other two data sets were crawled from 55 Deep Web 
sources on 16 attributes of 1000 stocks. Stock-lday includes 
the data on 7/7/2011 and Stock-2wk includes the data from 
7/1/2011 to 7/14/2011. The former contains 16 * 1000 = 
16,000 data items and on average 6.5 conflicting values are 
provided for each data item; the latter contains 16,000 * 10 = 
160,000 data items and on average 5.7 conflicting values 

5 The data are at http://Iunadong.com/fusionDataSets.htm. 


are provided for each item. A gold standard for Stock-lday 
contains the voting results on the 100 NASDAQ symbols and 
100 other randomly selected symbols from 5 popular financial 
websites: NASDAQ, Yahoo! Finance, Google Finance, MSN 
Money, and Bloomberg. 

The four data sets have very different features. Book-full 
and Stock-2wk contain a large number of data items. Book-CS 
and Book-full contain a large number of data sources; however, 
some sources contain only a few data items ( e.g ., 85% sources 
in Book-CS each covers at most 1% books). Stock-lday and 
Stock-2wk contain much fewer sources, but each source has 
a much higher coverage (e.g., 80% sources each covers over 
half of the data items). 

Implementation: We implemented various methods for copy 
detection and describe them as follows. 


• Pairwise examines each pair of sources as described at 
the beginning of Section IIlbIE). 

• Sample 1 randomly samples 1% of data items on Stock- 
2wk and 10% on the other data sets, then applies PAIR¬ 
WISE on the sampled data. 

• SAMPLE2 is different from SAMPLE 1 on the two Book 
data sets. It considers each data set as a table where each 
row represents a source and each column represents a data 
item. It randomly samples data items (columns) until the 
number of non-empty cells reaches 65% on Book-CS and 
24% on Book-full (we explain the need for such sampling 
rates shortly). 

• Index implements algorithm Index (Section m 

• BOUND and Bound+ each applies the corresponding 
algorithm (Section |IV) > for each round. 

• Hybrid applies Index for a pair of sources that share 
at most 16 data item^] and applies BOUND+ for other 
pairs in each round (end of Section IV i 

• Incremental applies Hybrid in the first two rounds 
and Algorithm INCREMENTAL (Section 0 in later 
rounds. [|lt sets p to .2 for source accuracy and to 1.0 for 
value probability according to observations of the largest 
gaps on differences of changes. 

. Scales ample applies Incremental on a sampled 
data set, where we sample 1% of data items on Stock-2wk 
and 10% on the other data sets, and guarantee sampling 
at least N = 4 data items from each source. 

• FaginInput generates the input to Fagin’s NRA algo¬ 
rithm as described at the end of Section III-BI 

In addition, we used the truth-finding algorithm in E), 
which considers both copying and source accuracy. We 
plugged in the aforementioned copy-detection algorithms. 

We implemented the algorithms in Java on a Windows 
machine with Intel Core i5 processor (3.2GHz, 4MB cache, 
4.8 GT/s QPI, 8GB memory). 

Measures: We measure three aspects of different methods. 


6 We observe empirically that when two sources share fewer than 
16 data items, INDEX conducts fewer computations than BOUND+ 
on average. 

Empirically we found that copy-detection and truth-finding results 
vary a lot in the first two rounds in general, so applying INCREMEN¬ 
TAL in the second round would not save much. 














TABLE VI 

Copy-detection and truth-discovery quality of various algorithms. Except fusion accuracy, all measures are computed by 

COMPARING WITH RESULTS OF PAIRWISE. SAMPLE2 OBTAINS THE SAME RESULTS AS SAMPLEl ON Stock DATA. 



Book-CS 

Stock-lday 

Method 

Copy detection 

Truth discovery 

Copy detection 

Truth discovery 


Prec 

Rec 

F-msr 

Accu 

Fusion diff 

Accu var 

Prec 

Rec 

F-msr 

Accu 

Fusion diff 

Accu var 

Pairwise 

- 

- 

- 

.890 

- 

- 

- 

- 

- 

.897 

- 

- 

SampleI 

.691 

.165 

.264 

.870 

.070 

.127 

.967 

.945 

.956 

.896 

.008 

.001 

Sample2 

.886 

.696 

.779 

.880 

.029 

.089 

.967 

.945 

.956 

.896 

.008 

.001 

Index 

1 

1 

1 

.890 

0 

0 

1 

1 

1 

.897 

0 

0 

Hybrid 

.990 

.980 

.985 

.890 

.015 

.039 

1 

.970 

.985 

.897 

.002 

.001 

Incremental 

.985 

.975 

.980 

.890 

.015 

.037 

.993 

.947 

.969 

.897 

.003 

.001 

ScaleSample 

.930 

.841 

.882 

.890 

.029 

.055 

.970 

.927 

.948 

.897 

.008 

.001 


TABLE VII 

Execution time and the time improvement compared with the previous method (SampleI, Sample!, Index comparing with Pairwise; 

OTHERS COMPARING WITH THE METHOD IN THE ABOVE ROW). SAMPLE2 OBTAINS THE SAME RESULTS AS SAMPLEl ON Stock DATA. 


Method 

Book-CS 

Stock-lday 

Book-full 

Stock-2wk 

Time (s) 

Improvement 

Time (s) 

Improvement 

Time (s) 

Improvement 

Time (s) 

Improvement 

Pairwise 

321 

- 

306 

- 

11536 

- 

3408 

- 

SampleI 

3.2 

99% 

16.2 

95% 

278 

98% 

55 

98% 

Sample2 

32 

90% 

16.2 

95% 

684 

94% 

55 

98% 

Index 

1.6 

99.5% 

25.0 

92% 

47.7 

99.6% 

573 

83% 

Hybrid 

1.2 

24% 

15.8 

37% 

47.2 

2% 

443 

23% 

Incremental 

0.4 

65% 

6.9 

56% 

7.9 

83% 

127 

72% 

ScaleSample 

0.3 

25% 

0.7 

90% 

3.8 

52% 

1.4 

99% 

Total Improvement 


99.91% 


99.8% 


99.97% 


99.96% 


Efficiency: We measure efficiency by (1) the number of 
computations in copy detection (as described in the examples 
in Sections pHflVl i, and (2) the execution time. 

Copy-detection correctness: We examined how the various 
methods for improving scalability may hurt the results of copy 
detection; thus, we compared their results with those of PAIR¬ 
WISE. Precision measures among the output copying pairs, 
what fraction is also output by PAIRWISE; Recall measures 
among the output copying pairs by PAIRWISE, what fraction 
is output by the specific method; F-measure is computed by 

‘l-'precision-recall 
precision-\-recall 

Truth-finding correctness: We also examined how the copy- 
detection results may affect truth finding. We report three 
measures: (1) Fusion accuracy measures the fraction of correct 
truth-finding results among all data items in the gold standard; 

(2) Fusion difference measures the fraction of truth-finding 
results different from those when applying PAIRWISE; and 

(3) Accuracy variance measures the average difference of the 
source accuracies we compute when applying PAIRWISE and 
the specific copy-detection method. 

We report efficiency on all data sets and other results only 
on the two small data sets Book-CS and Stock-lday. 

B. Performance overview 

We first compare the various methods on each data set. 
Table [Vi] reports copy-detection and truth-finding correctness, 
and Table |VII| reports execution time. 

First, naive sampling (SAMPLEl and SAMPLE2) did im¬ 
prove the efficiency a lot, but not as much as INCREMENTAL 
and SCALES AMPLE. Indeed, on the Stock data sets they are 
one order of magnitude slower than SCALES AMPLE and on the 
Book data sets they are even slower than Index. In addition, 


SampleI obtains very low F-measure on copy detection for 
Book-CS, where a lot of data sources provide only a few books, 
so a random sampling can lead to inaccurate decisions. 

Second, our proposed methods for improving scalability 
work very well. Without sampling, INCREMENTAL finished 
in about 2 minutes for Stock-2wk and seconds for other 
data sets. In particular, the use of the inverted index in 
itself (Index) on average reduced execution time by 94% 
and obtains exactly the same results for copy detection and 
truth discovery as PAIRWISE. It works especially well for the 
two Book data sets (improving by two orders of magnitude) 
because a lot of source pairs (95.6% on average) do not share 
any data item and need not to be considered at all. Also, 
we observe from Table [V] that on average only 42% values 
are provided by multiple sources and so are indexed. Pruning 
(Hybrid) on average reduced execution time further by 21% 
and changed copy-detection and truth-discovery results very 
slightly. Incremental detection (INCREMENTAL) on average 
reduced execution time further by 69% and also changed 
the results very slightly. The two enhancements together 
reduced execution time by 77% on average and sacrificed 
precision and recall of copy detection by at most 5%; they 
also changed results of truth-discovery very slightly, by up to 
1.5%. We observed from our experiments that indexing costs 
57% of execution time in INCREMENTAL, but it spent only 
.9% execution time of PAIRWISE and significantly improves 
scalability, so is worthwhile. 

Third, sampling helps with a small sacrifice on effective¬ 
ness: ScaleS ample finished within a few seconds for all data 
sets with reasonable F-measure for copy detection and very 
similar results for truth discovery. On the Stock data sets, the 
improvement corresponds to the sampling rate: 90% for Stock- 
lday (sampling rate .1) and 99% for Stock-2wk (sampling rate 









































.01); in addition, the F-measure and fusion results are very 
similar to INCREMENTAL, which does not do sampling. On the 
Book data sets, the efficiency was improved but not as much 
(by 25% and 52% respectively), and the F-measure of copy 
detection drops. Recall that in these two data sets there are a 
lot of low-coverage sources, making sampling much harder. 
Indeed, we ended up sampling 49% data items for Book- 
CS and 19% items for Book-full. However, we obtain much 
higher F-measure than SampleI and SAMPLE2, showing 
effectiveness of sampling at least N = 4 data items. Last, 
we note that sampling in itself has a very small overhead for 
small data sets (5% of execution time on average) but a larger 
overhead for large data sets (37% on average); this is because 
checking whether each source covers N sampled data items 
takes longer time for large data sets. 

Finally, Table |X| shows the execution time ratio of our meth¬ 
ods versus FaginInput. FaginInput has two drawbacks. 
First, it has to compute the contribution scores from each 
shared value for each source pair ; thus, Hybrid is 18% faster 
than FaginInput on average for a single round. Second, it is 
not clear how to generate the input lists incrementally in later 
rounds ; thus, Incremental is 75% faster than FaginInput 
on average for all rounds. 


TABLE VIII 

Execution time ratio of Incremental vs Hybrid, percentage of 

PAIRS TERMINATED AT EACH PASS OF INCREMENTAL DETECTION 



Book-CS 

Stock-lday 

Book-full 

Stock-2wk 

Round 3 

14.0% 

6.9% 

3.1% 

7.3% 

Round 4 

12.2% 

6.8% 

3.3% 

4.7% 

Round 5 

10.2% 

6.1% 

3.4% 

4.4% 

Round 6 

9.6% 

6.4% 

3.3% 

4.9% 

Round 7 

10.2% 

- 

3.7% 

- 

Round 8 

9.6% 

- 

3.1% 

- 

Round 9 

- 

- 

3.0% 

- 

Pass 1 

99% 

98% 

86% 

99% 

Pass 2 

0 

1% 

4% 

0 

Pass 3 

1% 

1% 

10% 

1% 


TABLE IX 

Comparing different sampling methods. 



Book-CS 

Stock-lday 

Method 

Prec 

Rec 

F-msr 

Prec 

Rec 

F-msr 

ScaleSample 

.92 

.84 

.88 

.98 

.94 

.96 

ByItem 

.85 

.56 

.67 

.98 

.94 

.96 

ByCell 

.89 

.70 

.78 

.98 

.94 

.96 


TABLE X 

Execution-time ratio w.r.t. FaginInput. 



Book-CS 

Stock-lday 

Book-full 

Stock-2wk 

Hybrid 

.87 

.76 

.99 

.67 

Incremental 

.30 

.27 

.22 

.19 


C. Single-round algorithms 

We next examine single-round algorithms in more detail. 
We first compare Index, Bound, Bound+, and Hybrid 
on their numbers of computations (for all rounds together) 
and copy-detection time (see Figure |2|. We have three ob¬ 
servations. First, for three out of four data sets BOUND 
conducts more computations and finished in longer time than 
Index. Although it reduces the number of data items for 
consideration, it introduces a big overhead for computing the 
minimum and maximum scores. Second, BOUND+ speeds up 
copy detection significantly: on average it reduces the number 
of computations by 55% and saves copy-detection time by 
37% over Bound. Third, Hybrid further saves 20.3%, 22.9% 
computations and 4.6%, 11.6% copy-detection time on Book- 
CS and Book-full respectively. It does not make a difference 
on the two Stock data sets, because there each pair of sources 
share a lot of data items. 

We then examined various orders of processing entries in 
the inverted index: Random processes the entries randomly; 
ByProvider processes the entries in increasing order of the 
number of providers ( i.e ., sources); and ByCONTRIBUTION 
processes the entries in decreasing order of contribution (pro¬ 
posed in this paper). Figure [3] shows the execution time of 
each of the latter two compared with random ordering for 
Bound and Hybrid. We observed that ByContribution 
is the fastest among the three ordering schemes. When we 
apply Bound, it improves over Random by 12% on average 
and by 24% for Stock-1 day, it improves over ByProvider 
by 7% on average and by 22% for Stock-1 day. When we 
apply Hybrid, which skips many computations by setting up 
a timer, the benefit of ByContribution is less evident but it 
is still the fastest. We also note that although ByProvider is 
better than Random, it may process some true but not widely 


provided values towards the beginning and so can incur more 
computation than ByContribution. 

D. Incremental algorithms 

To understand how incremental detection improves effi¬ 
ciency, we show in Table VIII the execution time ratio of 


Incremental versus Hybrid round by round. Indeed, incre¬ 
mental detection saves execution time significantly: on average 
it improves over Hybrid by 97% for indexing, 52% for copy 
detection, and 93.5% in total. We also show in Table |VIII| 
how many pairs terminate at each of the three passes We 
observe that in the first pass 86% pairs terminate for Book- 
full and over 98% pairs terminate for other data sets. This 
verifies our intuition and explains why INCREMENTAL can 
save computation significantly. 


E. Sampling 

Finally, we compare our sampling strategy, called SCALE- 
S AMPLE, with sampling rate 10%, with two naive sampling 
strategies as described in SampleI and SAMPLE2, which 
we call ByItem and ByCell respectively; here we apply 
Incremental on all samples. To ensure a fair comparison, 
the sampling rate for ByItem is decided by the percentage of 
sampled data items in ScaleSample, and the sampling rate 
for ByCell (and Sample 2) is decided by the percentage of 
sampled cells in ScaleSample. For example, ScaleSam¬ 
ple sampled 49% data items and 65% cells on Book-CS, so 
we applied a sampling rate of 49% for ByItem and 65% 
for ByCell; ScaleSample sampled 10% data items and 
10% cells on Stock-lday, so ByItem and ByCell applied 


the same sampling rate (10%). Table IX shows the quality 































Fig. 2. Single-round algorithms. 


Fig. 3. Different index ordering. 


of copy-detection results compared to applying Index. The 
three sampling methods obtain the same results on Stock- 
lday since the sources all have a high coverage in that data 
set; ScaleSample obtains the best results on Book-CS even 
though it selects the same number of data items as ByItem 
and the same number of cells as ByCell, since it guarantees 
that we select at least N = 4 data items from each source 
when possible. 

VII. Related Work 

Copy detection has been studied recently in (2), 0, 0, 
m , oa. Prior work has focused on effectiveness rather 
than efficiency of detection. As our experiments show, our 
algorithms can improve the efficiency over state-of-the-art 
algorithms (Pairwise) by three or more orders of magnitude, 
without sacrificing the quality much. 

Improving scalability of copy detection has been intensively 
studied for text documents and software programs (surveyed 
in QD). For documents, copy detection considers sharing 
sufficiently large text fragments as evidence of copying. The 
naive strategy looks for the longest common subsequences 
(LCS), but can take time 0(ni ■ 712 ) for documents of sizes 
n± and ri .2 respectively, and needs to compare every pair of 
documents. The first improvement is to build fingerprints for 
each document and only selectively store and compare the 
fingerprints. Manber H3i fingerprints each sequence of Q 
consecutive tokens ( Q-gram ), and builds a sketch with Q- 
grams whose fingerprints are 0 mod K\ the space usage is 
thus only of original documents. Brin et al. 0 divides 
each document into non-overlapping chunks, where the last 
unit of each chunk has a fingerprint that is 0 mod K, and 
sketches each chunk; again, the space usage is expected to be 
^ of original documents. Schleimer et al. El also fingerprints 
each Q-gram, but the sketch contains the smallest fingerprint in 
each /C-window; it has the same space usage but is guaranteed 
to find reuse of text with length of at least K + Q — 1. 
Another improvement is to build an index for the sketches, 
such that two documents are compared only if they share some 
fingerprints DU. 

We also build an inverted index for the provided values and 
skip pairs of sources that do not share any value; however, our 
index is different in many ways. First, each entry in the index 
is associated with a score, indicating how strong sharing the 
value can serve as evidence for copying. Second, the entries 
are processed in decreasing order of the scores, so we consider 
stronger evidence first and can stop computation for a pair 
of sources when we have accumulated sufficient evidence for 


deciding copying or no-copying. Third, source pairs that share 
only a few entries with small scores will also be skipped for 
copy detection. Finally, we additionally design algorithms for 
pruning and incremental copy detection, which have not been 
discussed for document copy detection. 

VIII. Conclusions 

Copy detection has been shown to be crucial for truth 
finding on Web data but meanwhile is a bottle-neck in data 
fusion. This paper proposed various methods for improving 
scalability of copy detection on structured data. Experimental 
results show that the proposed algorithm can reduce copy- 
detection time by several orders of magnitude and finish fast 
on large data sets. 

Our algorithms provide two opportunities for parallelization 
in a Hadoop framework. First, when we process each index 
entry, we can parallelize score computation for each pair of 
sources in that entry. Second, we can parallelize computation 
among entries: whereas parallelizing on all entries would be 
hard given the possibly huge number of entries, BOUND+ 
provides good insights on which entries can be processed 
in parallel. Both approaches are likely to be better than the 
strategy that simply extends PAIRWISE by parallelizing copy 
detection for each pair of sources, as the total number of pairs 
can be huge for big data. We leave such extensions and an 
experimental comparison for future work. 
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